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Abstract 

The Euclidean fc-means problem is a classical problem that has been extensively studied in the 
theoretical computer science, machine learning and the computational geometry communities. In this 
problem, we are given a set of n points in Euclidean space M'*, and the goal is to choose k center points 
in R'* so that the sum of squared distances of each point to its nearest center is minimized. The best 
approximation algorithms for this problem include a polynomial time constant factor approximation for 
general k and a (1 + e)-approximation which runs in time poly(n) exp(fc/e). At the other extreme, the 
only known computational complexity result for this problem is NP-hardness [T]. The main difficulty in 
obtaining hardness results stems from the Euclidean nature of the problem, and the fact that any point 
in R'* can be a potential center. This gap in understanding left open the intriguing possibility that the 
problem might admit a PTAS for all k, d. 

In this paper we provide the first hardness of approximation for the Euclidean fc-means problem. 
Concretely, we show that there exists a constant e > 0 such that it is NP-hard to approximate the k- 
means objective to within a factor of (1 + e). We show this via an efficient reduction from the vertex cover 
problem on triangle-free graphs: given a triangle-free graph, the goal is to choose the fewest number of 
vertices which are incident on all the edges. Additionally, we give a proof that the current best hardness 
results for vertex cover can be carried over to triangle-free graphs. To show this we transform G, a known 
hard vertex cover instance, by taking a graph product with a suitably chosen graph H, and showing that 
the size of the (normalized) maximum independent set is almost exactly preserved in the product graph 
using a spectral analysis, which might be of independent interest. 


1 Introduction 

Clustering is the task of partitioning a set of items such as web pages, protein sequences etc. into groups of 
related items. This is a fundamental task in machine learning, information retrieval, computational geometry, 
computer vision, data visualization and many other domains. In many applications, clustering is often used 
as a first step toward other fine grained tasks such as classification. Needless to say, the problem of clustering 
has received significant attention over the years and there is a large body of work on both the applied and 
the theoretical aspects of the problem EllllinilllllllllllinilMlIHlIMlIM]- A common way to approach 
the task of clustering is to map the set of items into a metric space where distances correspond to how 
different two items are from each other. Using this distance information, one then tries to optimize an 
objective function to get the desired clustering. Among the most commonly used objective function used in 
the clustering literature is the /c-means objective function. In the /c-means problem, the input is a set S of 
n data points in Euclidean space and the goal is to choose k center points C* = {ci, C2,..., Ck} from R'^ 
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so as to minimize $ = 11^ ~ c(a;)||^, where c{x) G C* is the center closest to x. Aside from being 

a natural clustering objective, an important motivation for studying this objective function stems from the 
fact that a very popular and widely used heuristic (appropriately called the k-means heuristic [55]) attempts 
to minimize this fc-means objective function. 

While the fc-means heuristic is very much tied to the fc-means objective function, there are many examples 
where it converges to a solution which is far away from the optimal fc-means solution. This raises the 
important question of whether there exist provable algorithms for the /c-means problem in general Euclidean 
space, which is the focus problem of our paper. Unfortunately though, the approximability of the problem 
is not very well understood. From the algorithmic side, there has been much focus on getting (1 + e)- 
approximations that run as efficiently as possible. Indeed, for fixed k, Euclidean fc-means admits a PTAS [551 
dSj. These algorithms have exponential dependence in fc, but only linear dependence in the number of points 
and the dimensionality of the space. As mentioned above, there is also empirical and theoretical evidence 
for the effectiveness of very simple heuristics for this problem [33l [281 HI] . For arbitrary fc and d, the best 
known approximation algorithm for fc-means achieves a factor of 9 -I- e m- In contrast to the above body 
of work on getting algorithms for fc-means, lower bounds for fc-means have remained elusive. In fact, until 
recently, even NP-hardness was not known for the fc-means objective mm- This is perhaps due to the 
fact that as opposed to many discrete optimization problems, the fc-means problem allows one to choose any 
point in the Euclidean space as a center. The above observations lead to the following intriguing possibility 

Is there a PTAS for Euclidean k-means for arbitrary k and dimension d? 

In this paper we answer this question in the negative and provide the first hardness of approximation for 
the Euclidean fc-means problem. 

Theorem 1. There exists a constant e > 0 such that it is NP-hard to approximate the Euclidean k-means 
to a factor better than (1 -I- e). 

The starting point for our reduction is the Vertex-Cover problem on triangle-free graphs: here, given a 
triangle-free graph, the goal is to choose the fewest number of vertices which are incident on all the edges 
in the graph. This naturally leads us to our other main result in this paper, that of showing hardness 
of approximation of vertex cover on triangle-free graphs. Kortsarz et al [24] show that if the vertex cover 
problem is hard to approximate to a factor of a > 3/2, then it is hard to approximate vertex cover on triangle- 
free graphs to the same factor of a. While such a hardness (in fact, a factor of 2 — e [55]) is known assuming 
the stronger unique games conjecture, the best known NP-hardness results do not satisfy a > 3/2. We settle 
this question by showing NP-hardness results for approximating vertex cover on triangle-free graphs, which 
match the best known hardness on general graphs. 

Theorem 2. It is NP-hard to approximate Vertex Cover on triangle-free graphs to within any factor smaller 
than 1.36. 


2 Main Technical Contribution 

In Section 01 we show a reduction from Vertex-Cover on triangle-free graphs to Euclidean fc-means where the 
vertex cover instances have small cover size if and only if the corresponding fc-means instances have a low cost. 
A crucial ingredient is to relate the cost of the clusters to the structural properties of the original graph, which 
lets us transition from the Euclidean problem to a completely combinatorial problem. Then in Section [S] we 
prove that the known hardness of approximation results for Vertex-Cover carry over to triangle-free graphs. 
This improves over existing hardness results for vertex cover on triangle-free graphs [24] . Furthermore, we 
believe that our proof techniques are of independent interest. Specifically, our reduction transforms known 
hard instances G of vertex cover, by taking a graph product with an appropriately chosen graph H. We 
then show that the size of the vertex cover in the new graph (in proportion to the size of the graph) can be 
related to spectral properties of H. In fact, by choosing H to have a bounded spectral radius, we show that 
the vertex covers in G and the product graph are roughly preserved, while also ensuring that the product 
graph is triangle-free. Combining this with our reduction to fc-means completes the proof. 
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3 Related Work 


Arthur and Vassilvitskii proposed fc-meansH—h, a random sampling based approximation algorithm for 
Euclidean fc-means which achieves a factor of O(logfc). This was improved by Kanungo et al. [H] who 
proposed a local search based algorithm which achieves a factor of (9 + e). This is currently the best 
known approximation algorithm for fc-means. For fixed k and d, Matousek m gave a PTAS for fc-means 
which runs in time 0{ne~'^^ ‘^log^n)). Here n is the number of points and m is the dimensionality of the 
space. This was improved by Badoiu et al. [7] who gave a PTAS for fixed k and any d with run time 
log* n). Kumar et al. [5S] gave an improved PTAS with exponential dependence in k 
and only linear dependence in n and d. Feldman et al. |16] combined this with efficient coreset constructions 
to give a PTAS for fixed k with improved dependence on k. The work of Dasgupta [n] and Aloise et al. [I] 
showed that Euclidean A:-means is NP-hard even for k = 2. Mahajan et al. [30] also show that the A:-means 
problem is NP-hard for points in the plane. 

There are also many other clustering objectives related to fc-means which are commonly studied. The 
most relevant to our discussion are the /c-median and the fc-center objectives. In the first problem, the 
objective is to pick k centers to minimize the sum of distances of each point to the nearest center (note that 
the distances are not squared). The problem deviates from fc-means in two crucial aspects, both owing to the 
different contexts in which the two problems are studied: (i) the fc-median problem is typically studied in the 
setting where the centers are one of the data points (or come from a set of possible centers specified in the 
input), and (ii) the problem is also very widely studied on general metrics, without the Euclidean restriction. 
The fc-median problem has been a testbed of developing new techniques in approximation algorithms, and has 
constantly seen improvements even until very recently |20|[Ii|27]. Currently, the best known approximation 
for fc-median is a factor of 2.611 -I- e due to Bykra et al. (9). On the other hand, it is also known that the k- 
median objective (on general metrics) is NP-hard to approximate to a factor better than (l-|-l/e) [19]. When 
restricted to Euclidean metrics, Kolliopoulos et al. [23] show a PTAS for fc-median on constant dimensional 
spaces. On the negative side for fc-median on Euclidean metrics, it is known that the discrete problem (where 
centers come from a specified input) cannot have a PTAS under standard complexity assumptions [T7]. As 
mentioned earlier, all these results are for the version when the possible candidate centers is specified in the 
input. For the problem where any point can be a center, Arora et al. [1] show a PTAS when the points are 
on a 2-dimensional plane. 

In the fc-center problem the objective is to pick k center points such that the maximum distance of 
any data point to the closest center point is minimized. In general metrics, this problem admits a 2-factor 
approximation which is also optimal assuming P ^ NP |18j . For Euclidean metric when the center could be 
any point in the space, the upper bound is still 2 and the best hardness of approximation is a factor 1.82 m- 


4 Our Hardness Reduction: Prom Vertex Cover to Euclidean k- 
means 

In this section, we show a reduction from the Vertex-Cover problem (on triangle-free graphs) to the fc-means 
problem. Formally, the vertex cover problem can be stated as follows: Given an undirected graph G = {V, E), 
choose a subset S of vertices (with minimum l^l) such that S is incident on every edge of the graph. More 
specifically, our reduction establishes the following theorem. 

Theorem 3. There is an efficient reduction from instances of Vertex Cover (on triangle-free graphs) to 
those of Euclidean k-means that satisfies the following properties: 

(i) if the Vertex Cover instance has value k, then the k-means instance has cost at most m — k. 

(ii) if the Vertex Cover instance has value at least k{l -|- e), then the optimal k-means cost is at least 
m— (1 — f2(e))fc. Here, e is some fixed constant > 0. 
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In [SI we show that there exist triangle-free graph instances of vertex cover on m = 0(n) edges, and 
k = n{n) such that it is NP-hard to distinguish if the instance has a vertex cover of size at most k, or all 
vertex covers have size at least (1 -I- e)fc, for some constant e > 0. 

Now, let k = m/A where A = 11(1) from the hard vertex cover instances. Then, from [31 we get that if 
the vertex cover has value k, then the A:-means cost is at most m(l — ■^), and if the vertex cover is at least 
k{l + e), then the optimal fc-means cost is at least m(l — Therefore, the vertex cover hardness says 

that it is also NP-hard to distinguish if the resulting fc-means instance has cost at most to( 1 — ■^) or cost 
more than m(l — -— 5 ^). Since A is a constant, this implies that it is NP-hard to approximate the fc-means 
problem within some factor (1 -|- fl(e)), thereby establishing our main result[I] In what follows, we provejSl 

4.1 Proof of[3] 

Let G = (y, E) denote the graph in the Vertex Cover instance I, with parameter k denoting the number of 
vertices we can select. We associate the vertices with natural numbers [n]. Therefore, we refer to vertices by 
natural numbers i, and edges by pairs of natural numbers 

Construction of k-means Instance Ikm- For each vertex i G [n], we have a unit vector x^ = (0,0,..., 1,..., 0) 
which has a 1 in the coordinate and 0 elsewhere. Now, for each edge e = {i,j), we have a vector 
Xe Ci + Cj. Our data points on which we solve the fc-means problem is precisely {x^ : e G E}. This 
completes the definition of Ikm ■ 

Note 4. As stated, the dimensionality of the points we have constructed is n, and we get a hardness factor of 
(1 -l-e). However, hy using the dimensionality reduction ideas of Johnson and Lindenstrauss (see, e.g. 
without loss of generality, we can assume that the points lie in 0{\ogn/e^) dimensions and our hardness 
results still hold true. This is because, after the transformation, all pairwise distances (and in particular, 
the k-means objective function) are preserved upto a factor of (1 -|- e/ 10 ) of the original values, and so our 
hardness factor is also (almost) preserved, i.e., we would get hardness of approximation of (1 -I- 0(e)). 

However, for simplicity, we stick with the n dimensional vectors as it makes the presentation much 
cleaner. 

4.2 Completeness 

Suppose I is such that there exists a vertex cover S* = {vi,V 2 ,... ,Vk} of k vertices which can cover all 
the edges. We will now show that we can recover a good clustering of low fc-means cost. To this end, let 
Ey^ denote the set of edges which are covered by Uf for 1 < f < k. If an edge is covered by two vertices, 
we assume that only one of them covers it. As a result, note that the Eyfs are pairwise disjoint (and their 
union is E), and each Ey,^ is of the form {{vi, ic£,i), (vi, W£^ 2 ), ■ • ■, (vi, Wi^p^)}. 

Now, to get our clustering, we do the following: for each v G S*, form a cluster out of the data points 
Hy := {xe : e G Ey}. We now analyze the average connection cost of this solution. To this end, we begin 
with some easy observations about the k-means clustering. Indeed, since any cluster is of a set of data points 
(corresponding to a subset of edges in the graph G), we shall abuse notation and associate any cluster T also 
with the corresponding subgraph on V, i.e., H C E. Moreover, we use dj(i) to denote the degree of node i 
in T and toj to denote the number of edges in T, mgr = |T|. Finally, we refer by dcii) the degree of vertex 
i in G. 

Claim 5. Given any clustering {T}, the following hold. 

(i) J^s^djii) = daii). 

(ii) = 2m = 2\E\. 

Proof. The proof is immediate, because every edge e G E belongs to exactly one cluster in {T}. □ 
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Our next claim relates the connection cost of any cluster J to the structure of the associated subgraph, 
which forms the crucial part of the analysis. 

Claim 6 . The total connection cost of any cluster IF is: 


E 


dj(i)(l- d3^{i)). 

TOgr 


Proof. Firstly, note that ^ - d 3 -{i) 
that at coordinate i G V: 


2mj. Now consider the center /xgr of cluster IF. By definition, we have 




1 

m 


E 1 

SG3":ieS 


d^{i) 

m 


So ds^ii)'^. Hence the total cost of this clustering is: 


eea^ ee3" 

=2mgr - — V ds^iif = V dg^(i) - — 
m m 


The hrst equality here uses the fact that mgrfigr = and the second equality uses the fact that 

||xe||^ = 2 for each data point. □ 


Claim 7. There exists a clustering of our k-means instance X^m with cost at most m — k, where m is the 
number of edges in the graph G = (V, E) associated with the vertex cover instance I, and k is the size of the 
optimal vertex cover. 

Proof. Consider a cluster Tv, which consists of data points associated with edges covered by a single vertex 
V. Then, by El the connection cost of this cluster is precisely — 1, since the sub-graph associated with a 
cluster is simply a star rooted at v. Here, is the number of edges which v covers in the vertex cover (if 
an edge is covered by different vertices in the cover, it is included in only one vertex). Then, summing over 
all clusters, we get the claim. □ 


4.3 Soundness 


In this section, we show that if there is a clustering of low fc-means cost, then there is a very good vertex 
cover for the corresponding graph. We begin with some useful notation. 


Notation 8 . Given a set E' C (^) of tue' = \E'\ edges with corresponding node degrees (di,. 
define Cost(iil') as the following: 


Cost(F;') ^ (l- 

uGV 


du \ 
mE'^' 


■ ■,dn), we 


Note that, by El the connection cost of a clustering T = {Ti,T 2 , ■ ■ ■ ,Tk} of the n points is equal to 
Cost(lFi). Recall that we abuse notation slightly and view each cluster Ti of the data points also as a 
subset of E. Moreover, because T clusters all points, the subgraphs IFi, IF 2 , ■ • ■, IF^ form a partition of E. 
Using this analogy, we study the properties of each subgraph and show that if the fc-means cost of T is small, 
then most of these subgraphs in fact are stars. This will in turn help us recover a small vertex cover for G. 
We begin with a simple property of Cost(U'). 


Proposition 9. For any set of mE' edges E' , tue' — 1 < Cost(U') < 2mE' — 1. 

Proof. We have Cost(£’') = = 2mE' — ■ The proof follows from noting that 

= 2 and < mE' + I- The last inequality is due to the fact that y^„cT/ dl is 

maximized by the degree sequence 1, 1 ,..., 1). □ 
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Theorem 10. If the k-means instance Tkm has a clustering T = {Ti,... ,Tfe} with Cost(T) < m — 

(1 — S)k, then there exists a (1 + 0{S))k-vertex cover of G in the instance X. 

Note that this, along with [7] would complete the proof of [31 
Proof. For each i G [fc], let |Ti| and Note that Cost(Ti) = 2mi — By|9l each 

cdcf 

i G [k] satisfies m-i — 1 < Cost(Ti) < 2mi — 1. Hence if we define 5i as 5i = Cost(Ti) — {rm — 1), then 
0 < Si < rrn. Moreover ^ = rm + 1 — Si. Thus: 

m — (1 — S)k > Cost(Ti) = — 1) = Si + m — k Sk > Si. 

i i i i 

This means, except < 2Sk clusters, the remaining clusters all have Si < ^. Moreover. (TTl implies all these 
(1 — 2S)k clusters are either stars or triangles and have Si = 0. Since the graph is triangle free, they are 
all stars, and hence the corresponding center vertices cover all the edges in the respective clusters. It now 
remains to cover the edges in the remaining 2Sk clusters which have larger Si values. Indeed, even for these 
clusters, we can appeal to [TTl and choose two vertices per cluster to cover all but Si edges in each cluster. 
So the size of our candidate vertex cover is at most fc(l + 2S), and we have covered all but ^ - Si edges. But 
now, we notice that Si < Jfc, and so we can simply include one vertex per uncovered edge and would 
obtain a vertex cover of size at most fc(l + 3(5), thus completing the proof. □ 


Lemma 11. Given a graph Gg^ = (V, T) with m = |T| edges and degrees {di ,..., dn); let S be such that 


1 

m 




= m + 1 — S. 


There always exists an edge {u, z;} G T with du + dy — 1 > m + \ + S. Furthermore, if S < then (5 = 0 and 
Gy is either a star graph or a triangle. 

Proof. Since + dy), we can think of — du^ as the the expectation of + dy over a 

random edge chosen uniformly, {m,u} G E: 


1 

m 




\d>U + d 

u] ■ 


From this, we can immediately conclude the existence of an edge {u,v} with du + dy>m+l — S. Now to 
complete the second part of the Lemma statement, suppose dy > dy. The number of edges incident to {u, z;} 
is: 

dy + dy — l>m — S => dy + dy — 1 = m. 

So all edges are incident to zz, or z;, and < 2 if zc ^ {zz, z;}. If d„ < 1, then we are done. In the other 
case, we have dy > 2 > dy, for all w ^ {zz, z;}. Let a dy and /3 dy. The degree sequence (di,..., d„) is 
strongly majorized by the following sequence, d': 


a — P many 



P — 1 many 


Since J^y '^u Schur-convex, its value increases under majorization: 

(a + P — l)(a + P — S) =m(m + I — (5) < ^ dy < ^ d'J^ 

U U 

=a^ + P'^ + A{P - 1) + {a - P). 

0 <(a + /3 - 1)(5 + 2a + 4/3 - 4 - 2a/3 
={a + P- 1)(5 + 2a(l - P) + 4(/3 - I). 
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So we obtain 2a{l3 — 1) < (a + /3 — l)i5 + 4(/3 — 1). Since /3 > 2, we divide both sides by ^ — 1: 


2a < 


a 


i5 + 4 + 5<(5a + 4 + 5. 


In particular, (2 — (5)a <4 + 5 =+ a < < 3 as 5 < 1/2. Hence a < 2. Consequently, du = dy = 2 and 

m = du + dy — 1 = 3. There are two possible cases: The graph is either a 3-cycle or 4-path. In the latter 
case, the corresponding 6 is: 

5 = m + l-l^d„2 = 4-i(22 + 22 + l + l)=4-y = 


which is a contradiction and the graph is a triangle. □ 

Putting the pieces together, we get the proof ofO 

Remark 12. Unique Games Hardness: Khot and Regev ]2^ show that approximating Vertex-Cover 
to factor (2 — e) is hard assuming the Unique Games conjecture. Furthermore, Kortsarz et al. \24^ show 
that any approximation algorithm with ratio a > 1.5 for Vertex-Cover on 3-cycle-free graphs implies an a 
approximation algorithm for Vertex-Cover (on general graphs). This result combined with the reduction in 
this section immediately implies APX hardness for k-means under the unique games conjecture. In the next 
seetion we generalize the result of Kortsarz et al. \24^ by giving an approximation preserving reduetion from 
Vertex-Cover on general graphs to Vertex-Cover on triangle-free graphs. This would enable us to get APX 
hardness for the k-means problem. 


5 Hardness of Vertex Cover on Triangle-Free Graphs 


In this section, we show that the Vertex Cover problem is as hard on triangle-free graphs as it is on general 
graphs. To this end, for any graph G = {V,E), we define IS(G') as the size of maximum independent set in 
G. For convenience, we define rel-IS(G') as the ratio oflS{G) to the number of nodes in G: 


rel-IS(G) 


drf IS(G) 

1^1 


Similarly, let VC(G) be the size of minimum vertex cover in G and rel-VC(G) be the ratio . The 

following is well known, which says independent sets and vertex covers are duals of each other. 

Proposition 13. Given G = {V,E), I QV is an independent set if and only if C = V\ I is a vertex cover. 
In particular, IS(G) + VC(G) = \V\. 


We will prove the following theorem. 

Theorem 14. For any constant £ > 0, there is a (1 -\-£)-approximation-preserving reduction for independent 
set from any graph G = {V,E) with maximum degree A to triangle-free graphs with poly(A ,nodes 
and degree poly(A,£“^) in deterministic polynomial time. 

Combining \14\ with the best known unconditional hardness result for Vertex Cover, due to Dinur and Safra \14h 
we obtain the following corollary. 


Corollary 15. Given any unweighted triangle-free graph G with bounded degrees, it is NP-hard to approxi¬ 
mate Vertex Cover within any factor smaller than 1.36. 

Given two simple graphs G = (Vi,i?i) and H = {V 2 ,H 2 ), we define the Kronecker product of G and H, 
G ® H, as the graph with nodes V{G ® H) = Vi x V 2 edges: 


E{G®H) = |{(u,i), (n, j)}|{M,'y} e E{G), {i,j} S £1(R)|. 
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Observe that, if Aq and Ah denote the adjacency matrix of G and H, then Ag<»h = Aq 0 Ah- 

Given any symmetric matrix M, we will use ai{M) to denote the largest eigenvalue of M. For any 
graph G on n-nodes, we define the spectral radius of G, p{G), as the following: 


PiG) 


def 


max 

p_Le 


p'^AgP 

“W 


max(CT2(AG), |cr„(AG)|). 


Here e is the all 1 ’s vector of length n. 

Proposition 16. If H is triangle-free, then so does G® H. 

Proof. Suppose G®H has a 3-cycle of the form ((a,i), (6, j), (c, k), (a,i)). Then (i,j, k,i) is a closed walk in 
H. H is triangle-free, therefore i = j wlog; a contradiction as H has no loops. □ 

The following Lemma says that as long as H has good spectral properties, the relative size of maximum 

independent sets in G will be preserved by G ® H. 

Lemma 17. Suppose H is a d-regular graph with spectral radius < p. For any graph G with maximum 
degree A, 

rel-IS(G (g)H)> rel-IS(G) > (l - rel-IS(G (g) H). 

Proof. Suppose V{G) = [n] and V{H) = [TV], Let A Aq be the adjacency matrix of G and B be the 
normalized adjacency matrix of H, 

S = 

d 

For the lower bound, consider an independent set / in G. It is easy to check that / x [A^] is an independent 

set in G (g) iJ, thus IS(G ® H) > N ■ IS(G) rel-IS(G ® H)> rel-IS(G). 

For the upper bound, consider the indicator vector / € of an independent set in G (g) iJ. 

Since the corresponding set contains no edges from G® H, 

fiA®B)f = 0. 


Define p € [0,1]^ as the following vector: 


def 1 

Pu = ^ 


yj /“d- 


For each u £ [n], pick u with probability Let Iq C [n] be the set of picked nodes. Next, start with I ^ Iq. 
As long as there is an edge of G contained in /, arbitrarily remove one of its endpoints from I. At the end 
of this process, the remaining set I is an independent set for G, and its size is at least the size of /q minus 
the number of edges contained in /q. Hence |/| > \Io\ — lEcilo, Io)\- Observe that 


E[|/o|] = '^Pu = ^ll/lP- (since / is a {0,1} vector, its li norm is the same as the I 2 norm.) 


The probability of any pair i ^ j being contained in Jq is given by 
Prob[{z, j} C/o] = PuPv- 
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Therefore, the expected number of edges contained in /q is ^p"^Ap: 


E[|£^g(4,4)|] = AuvPuPv = -^p'^Ap. 


u<.v 

1 




1 

'-m 

-2Nd' 


fA®{JM-B)f 


Putting it all together: 


E[|/|] >E[|Jo|] -E[|i?G(/o,/o)|] =^, 


?--,V^Ap>'^ 


■c 


m 

HU) 

(EOl) 

_ pA \ 

2d)' 


Therefore 


18(0) > IS(G <g)H) ^ rel-IS(G) > (l - rel-IS(G O H). 


□ 


In the remaining part, we prove the supporting claims. 

Claim 18. p'^Ap = ■^f'^{A 0 Jn)/ where Jn is the N-hy-N matrix of all 1/N’s. 


Proof. Let e'' 


be the matrix whose entry at vf^ row and column is 1, and all others 0. Notice 


A = ^ Au,ve'^’'". Let Jn be the iV-by-iV matrix of all I’s. For any pair {u, v) G 

PuPv X] 

p^Ap =1 ^ ® J^) / = [(^ ® j);;] / 

u,v u,v 

JN)f- 

The second-to-last identity follows from the bi-linearity of Kronecker product. 

Claim 19. f^{A(S)J^)f<\f^A(S){B-j;j)f\. 


□ 


Proof. We have/^(A(g) Jat)/=/^ A'S) {Jn — B) f + f'^ (A'S) B)f . As noted above, / being an independent 
set implies f^{A ® B)f = 0: 


f{A®JN)f = f A®{Jn-B) f <\fA®{JN-B)f\. 


□ 


Claim 20. \f^A®{B - J^)/| < 

■ a 

maxp Observe that p{M) = max(|cri(M)|). We have: 


def def 

Proof. Define C = B — Jn. For any symmetric matrix M, let p{M) be its spectral radius, p{M) = 


|/^A®(i3-J^)/| = |/^A®G/| <p(A®i3)||/||2. 

We know that the spectrum of the Kronecker product of two symmetric matrices correspond to the pairwise 
product of the spectrum of corresponding matrices, i.e., all eigenvalues of A® G are of the form ai{A) ■ aj (G) 
for each i and j. Therefore, 

p(A(g)G) = max(|crj(A)CTj(G)|) = max(|tTj(A)|) max(|crj(G)|) = p{A) ■ p{C). 
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Observe that p{A) < A, since A is the adjacency matrix of a graph with degree < A. Now we will upper 
bound p{C). Since H is a regular graph and B is its normalized adjacency matrix, the largest eigenvector of 
B is all I’s and the corresponding eigenvalue is 1. Therefore C has the same eigenspace with B. Moreover 
Ce = 0, thus: 


p{C) = max(|CTi(C')| : 1 < i < n) = max(|cri(_B)| : 2 < i < n) 

= max(cr2(S), |cr„(B)|) = □ 

We now prove the main theorem needed for our reduction. 

Theorem 21. Given a graph G = iV^E) with maximum degree A, for any e > 0, we ean eonstruct in 
polynomial time, a triangle-free graph G = {V, E) with 

rel-IS(G) < rel-IS(G) < (1 + e) rel-IS(G). 

Moreover G has (a) poly(A,e“^)|F| nodes, (b) degree 0(A^£“^). 

Proof. For any d and N, it is known how to construct [291132] in deterministic polynomial time, a 0{d)- 
regular Ramanujan graph iJ with girth n(\og^N) and spectral radius at most p < 0{\fd). Thus for some 
choice of d = 0(A^e“^) and N = = poly(A,e“^), we can find a d-regular graph El with girth at least 

0(1) and spectral radius p < de/A. For such H, let G ^ G®H. We have ^1 — < (l — e/2) ^ < 1 + e. 

[T51 implies G ® H is triangle free. By [171 

rel-IS(G) < rel-IS(G ® H) < (l - rel-IS(G) < (l + e) rel-IS(G). 

Now we prove the remaining properties: 

(a) \V{G (^H)\ = |R(G)| • \V{H)\ < |R| ■ poly(A,e-i). 

(b) dmaAG 0 i7) < dmaAG) X < 0{Ad) = 0{A^s~‘^). □ 

Note 22 . Noga Alon has provided an alternate construction where one can obtain a triangle free graph G 
such that rel-IS(G) = rel-IS(G). This however, does not lead to improved constant in our analysis. For the 
sake of eompleteness, we include the alternate theorem in the Appendix (See Theorem\Ef\) . 

We will end the section with the proof of \15[ We need the following hardness result oflW- It follows from 
their Gorollary 2.3 and Appendix 8 (weighted to unweighted reduction). As noted in the eonstruction 
produces bounded degree graphs. 

Theorem 23 (Dinur, Safra [14] 1. For any constant e > 0, given any unweighted graph G with bounded 
degrees, it is NP-hard to distinguish between: 

• (Yes) rel-IS(G) > c — e, 

• (No) rel-IS(G) < s + e; 

where c and s are constants such that ~ 1.36. 

Proof of \15[ Given a bounded degree graph G, consider the graph G given by |2T] for some small constant 
eo < e. Since G is bounded degree and eq is constant, G is also bounded degree. Furthermore, G satisfies 
rel-IS(G) < rel-IS(G) < (1 + eo) i'el-IS(G). Completeness follows immediately: rel-IS(G) > c — e. For the 
soundness, suppose rel-IS(G) > s + e. Then rel-IS(G) > > s + e for suitable eo. The hardness of Vertex 

Cover follows from|T3| □ 


10 



6 Conclusions 


In this paper we provide the first hardness of approximation for the fundamental Euelidean k-means problem. 
Although our work clears a major hurdle of going beyond NP-hardness for this problem, there is still a big 
gap in our understanding with the best upper bound being a factor (9 + e). We believe that our result and 
techniques will pave way for further work in closing this gap. Our reduction from vertex cover produces high 
dimensional instances (d = n{n)) of k-means. However, by using the Johns on-Lindenstrauss transform 
we can project the instance onto 0{\ogn/e^) dimensions and still preserve pairwise distances by a factor (1+e) 
and the k-means cost by a factor o/(l + e)^. We leave it as an open question to investigate inapproximability 
results for k-means in constant dimensions. It would also be interesting to study whether our techniques give 
hardness of approximation results for the Euclidean k-median problem. Finally, our hardness reduction mO 
provides a novel analysis by using the spectral properties of the underlying graph to argue about independent 
sets in graph products - this connection could have applications beyond the present paper. 
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A Appendix 

Theorem 24. Let G = {V, E) be an arbitrary graph with maximum degree A. It is possible to construct in 
polynomial time a triangle free graph G such that rel-IS(G) = rel-IS(G). 

Before proving the theorem, we need the following standard facts about (n, d, A) graphs. The following 
proofs are suggested by Noga Alon. 

Lemma 25. Let H = {Lf,F) he an {n,d,X) graph, assume A < d/4 and let B be a set of vertices of H. Let 
N{B) denote the set of all neighbors of B in H. Then: 

1. If \B\ > then |A^(i?)| > n — 

A If\B\ < then |tV(i?)| > 

Proof. Part 1 is proved in Corollary 1 in [2]. Part 2, for < \B\ < follows from the same corollary 
(which implies that in this range |A(_B)| > ^). For |i?| < ^^n, the result follows from the expander mixing 
lemma (see [3], corollary 9.2.5), as there are d\B\ edges between B and N(B). □ 

We now provide the proof of Theorem\24\ 

Proof. Let H = {U,F) be a (n, d, A)-expander with A < 2\/d — 10 Let G = G ® H. Further, let ^ > A. 
It is well known that such graphs exist. It is easy to see that any rel-IS(G ® H) > rel-IS(G), since any 
independent set S' in G leads to an independent set S ®U in G® H. 

For the other direction, let S C F x [/ be an independent set in G ® H. Define 

T = {v€V ■.\{u€U ■.{v,u)€ S}| > 4n} 

d 

Be Lemma 1251 part 1, T is an independent set in G. Let T' be a maximal (with respect to containment) 
independent set in G that contains T. By maximality, every vertex in F \ T' has at least one neighbor in T'. 
Thus T' is a dominating set in G and there is a collection of stars {S.u : v S T'}, covering all the vertices of 

^This means that all eigenvalues of H, except the first, are bounded by lambda. 
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G. As T' is an independent set, \T'\ < rel-IS(G)|y|. To complete the proof it suffices to show that for each 
of the stars in our collection whose set of vertices in G is 14 

: {v',u) G S,v' G 14}| < |G| = n (1) 

The number of leaves of the star Sv is at most A. For each such leaf v\ the set of vertices of H given by 

Byi = {u G U : (v', u) S S'} 

is of cardinality smaller than Moreover, all its neighbors in H cannot belong to the set By = {u G U : 
(vyu) G Sj where v is the center of the star Sy. By Lemma 1251 part 2, the number of these neighbors is at 
least ^ > A times the cardinality of Byi. This implies that the total size of all sets Byi where the sum 
ranges over all leaves v' of Sy is at most the number of vertices in U — By, implying [T] and completing the 
proof. □ 


14 


